MetaRank Usage Instructions
MetaRank is a Shiny-based application designed for non-parametric meta-analysis of ranked gene lists. It allows users to combine multiple pre-ranked gene lists into a single consensus ranked list using robust statistical methods (RankProd and RobustRankAggreg). In addition, MetaRank allows functional interpretation by performing overrepresentation analysis (ORA) on the top ranked genes (up to the top 100) of the consensus list.
This comprehensive tutorial provides step-by-step instructions on how to load ranked lists, customise ranking parameters, and visualise or interpret biologically enriched terms from the resulting consensus ranking.
- Overview
- Meta-analysis Options (RankProd and RobustRankAggreg)
- Rank Product (RankProd)
- Robust Rank Aggregation (RRA)
- Summary table
- RankProd Workflow
- Inputs
- Parameters
- Outputs
- RobustRankAggreg Workflow
- Inputs
- Parameters
- Outputs
- Shared Elements
- Shared Plots
- Shared Enrichment Analysis
- Data Visualization Tab
- Background Pipeline
- Best Practices
- Troubleshooting
1 Overview
MetaRank is a user-friendly Shiny application designed to unify ranked gene lists from multiple studies and extract a consensus ranking of the most consistently relevant genes. It provides two distinct analytical workflows, Rank Product (RP) and Robust Rank Aggregation (RRA), allowing users to choose the method that best fits their data structure and research goals. The app includes a rich set of features for data input, configuration, enrichment, and visualization. Its key features include:
Choice of meta-analysis algorithm:
Rank Product (RP):A weighted approach that incorporates both gene rankings and associated scores as pvalues or fold changes.Robust Rank Aggregation (RRA):A non-weighted method based on rank positions only, suitable for plain ordered gene lists.
Flexible input options and parameter settings tailored to each method:
- RP mode:
- Supports both file upload and text paste, using
"###"to separate lists. - Accepts
.TXT,.CSV, and.TSVfiles containing genes with numerical values. - Includes example datasets for quick testing and file downloads.
- Offers a choice between basic and advanced Rank Product functions.
- Customizable settings for handling
NAvalues and filtering genes by minimum list appearance.
- Supports both file upload and text paste, using
- RRA mode:
- Accepts plain-text files or pasted input with one gene per line, using
"###"to separate lists. - Only
.txtformat is supported to avoid structure conflicts. - Provides example data and downloadable templates.
- Includes several aggregation options such as “RRA”, geometric mean, median or minimum rank.
- Also allows configuration of
NAhandling and list inclusion thresholds.
- Accepts plain-text files or pasted input with one gene per line, using
- RP mode:
Post-ranking functional enrichment:
- Enables Over-Representation Analysis (ORA) on the top-ranked genes (selectable from 10 to 100).
- Compatible with Gene Ontology (GO), KEGG, and Reactome databases.
- Supports multiple organisms: Homo sapiens, Mus musculus, and Rattus norvegicus.
- Accepts gene identifiers in
SYMBOL,ENTREZID, andENSEMBLformats.
Interactive visualization and result export:
- Explore input overlap with heatmaps and UpSet plots.
- View enriched terms using interactive bar plots and dot plots.
- All result tables are interactive and downloadable in
.CSVor.TSVformats. - Customise each graph and table in the
Data Visualisationtab.
Figure 1: MetaRank overview
2 Meta-analysis Options
MetaRank provides two robust statistical methods for integrating ranked gene lists from multiple studies: Rank Product (RankProd) and Robust Rank Aggregation (RRA). These methods are designed to identify genes that consistently appear at the top of ranked lists, thereby highlighting potential candidates for further biological investigation.
2.1 Rank Product (RankProd)
The Rank Product method is a non-parametric statistical approach for identifying differentially expressed genes based on the consistency of gene rankings across multiple datasets. It is particularly suitable for meta-analyses that combine results from different studies, as it does not rely on data normality and is relatively robust to outliers.
Key features:
- Non-parametric analysis: Does not assume any particular data distribution, allowing application to heterogeneous datasets.
- Geometric mean aggregation: Ranks are combined using the geometric mean, giving greater weight to genes consistently ranked at the top.
- False discovery rate estimation: Provides an estimate of the proportion of false positives (pfp) to evaluate the statistical significance of the results.
- Cross-platform applicability: Designed to integrate data from diverse experimental conditions or technologies.
Recommended use cases:
- Works best with complete and balanced gene lists, where most genes are consistently represented across all datasets.
- Suitable when datasets have similar quality and measurement platforms, minimizing unwanted variability.
- Performs optimally with a moderate number of lists (approximately 5 to 20), ensuring a balance between sensitivity and computational cost.
- Less effective in the presence of high proportions of missing values or when gene representation is inconsistent, although these limitations can be mitigated using filtering based on minimum gene appearance and applying penalization strategies.
- Execution time is relatively high, especially when using permutation-based significance testing on large datasets.
- May be moderately influenced by outliers, particularly in smaller datasets with high variability.
The Rank Product method is implemented in the Bioconductor package RankProd, which provides functions for performing the analysis and visualizing the results.
2.2 Robust Rank Aggregation (RobustRrankAggreg)
Robust Rank Aggregation (RRA) is a probabilistic method designed to identify genes that are consistently ranked higher than expected by chance across multiple input lists. It is particularly effective in scenarios involving noisy data, incomplete lists, or substantial variability among datasets.
Key features:
- Probabilistic modeling: Computes p-values by modeling the probability of observing gene rankings under a null model.
- Robustness to noise and variability: Maintains performance in the presence of random noise, outliers, or inconsistencies between lists.
- Adaptable to varying list lengths: Can accommodate lists of different sizes without requiring imputation or alignment.
- No parameter tuning required: Offers a straightforward implementation without the need for user-defined parameters.
Recommended use cases:
- Appropriate for heterogeneous datasets obtained from different experimental conditions, platforms, or studies.
- Particularly suitable when gene lists are incomplete or vary significantly in content and length.
- Scales efficiently with a large number of input lists (more than 20), taking advantage of increased data diversity to improve robustness.
- Demonstrates high resistance to noise, performing reliably even if some input lists contain irrelevant or partially random data.
- Does not consider the magnitude of expression differences, focusing solely on rank order.
- Assumes independence among ranked lists, which may not always be valid in certain experimental designs.
- The interpretation of significance scores may be less intuitive due to the probabilistic nature of the method.
The RRA method is implemented in the CRAN package RobustRankAggreg, which offers functions for list aggregation and significance estimation.
2.3 Summary table
| Feature | Rank Product | Robust Rank Aggregation |
|---|---|---|
| Data completeness | Requires complete gene presence across lists | Supports partial and incomplete lists |
| List consistency | Performs best with uniform list lengths | Handles varying list lengths and contents |
| Handling of missing data | Limited unless filtered or penalized | Naturally tolerant to missing genes |
| Number of input lists | Optimal with 5–20 lists | Scales well with more than 20 lists |
| Noise resistance | Moderate | High |
| Execution time | Higher due to permutation testing | Lower, computationally efficient |
| Quantitative interpretation | Considers expression magnitude indirectly | Considers only rank order |
| Recommended applications | Datasets with consistent platforms and full coverage | Integration of diverse and incomplete datasets |
Table 1: Comparison of meta-analysis methods
3 RankProd Workflow
3.1 Input Methods
MetaRank allows users to input ranked gene lists for Rank Product analysis in three flexible ways:
Upload Files:
When the “Upload Files” mode is selected, users can upload one or more files in.TXT,.TSV, or.CSVformat. Each file represents a ranked gene list from a separate study. The system expects each file to contain at least two columns: a gene identifier (e.g.,TP53) and a numeric score representing the expression level, or any other ranking criterion. It is recommended to follow this instructions:- Supported encodings: UTF-8.
- Supported delimiters: comma (
,) or tab (\t) (automatically detected). - Do not include headers: remove the corresponding headers for either column names or row names.
- Scores are mandatory: if no score are detected, an error will be displayed
- Complex gene entries (e.g.,
HBA2///HBA1) are parsed, and only the first gene is retained. (HBA2) - NA values, blank lines, and duplicate genes are cleaned automatically.
Clicking the ℹ️ icon opens a modal window showing the expected file structure and format. There is no strict limit to the number of gene lists that can be uploaded, as it depends on the size of each list. For example, when lists contain approximately 20,000 genes, up to 12 have been successfully processed. In contrast, for smaller lists (ranging from 100 to 500 genes), the system has handled up to 50 lists without issue.
Many interface elements include contextual tooltips activated by hovering. These tooltips explain each input option, accepted formats, and internal validation steps. For example, hovering over the Use Example Data toggle reveals the origin of these datasets, while hovering over the text input area shows how to format pasted genes properly.
Paste Genes:
When the “Paste Genes” mode is enabled, users can manually paste ranked gene lists into a large text area. This mode supports both.CSVand.TSVformatting, selectable from a dropdown. Each list must be separated by the string###, and within each block, one gene per line is expected. The score must follow the gene, separated by a tab or comma. Unlike the file upload mode, the pasted input must include a header in each block with the exact column names:GeneandStat.data:Example format (TSV) Technical format (TSV) Gene Stat.data Gene\tStat.data\n TP53 0.95 TP53\t0.95\n BRCA1 0.91 BRCA1\t0.91\n EGFR 0.85 EGFR\t0.85\n ### ### Gene Stat.data Gene\tStat.data\n BRCA1 0.95 BRCA1\t0.95\n EGFR 0.91 EGFR\t0.91\n TP53 0.85 TP53\t0.85 Example format (CSV) Technical format (CSV) Gene,Stat.data Gene,Stat.data\n MYC,0.93 MYC,0.93\n CDK2,0.88 CDK2,0.88\n FOXO1,0.80 FOXO1,0.80n ### ### Gene,Stat.data Gene,Stat.data\n CDK2,0.93 CDK2,0.93\n FOXO1,0.88 FOXO1,0.88\n MYC,0.80 MYC,0.80Similar to file upload, pasted inputs are automatically cleaned of duplicates and malformed entries. The placeholder text in the input box provides a working example for guidance.
Figure 2: a) MetaRank input methods for Rank Product and b) details shown in the ℹ️ info window regarding the required file structure.
Use Example Data:
Enabling the “Use Example Data” switch loads four datasets for demonstration purposes. These examples simulate real analysis scenarios with pre-ranked gene lists across multiple studies, allowing users to explore the workflow without providing their own data.The table above shows four gene lists used in our example analysis. These lists come from four independent studies related to lung cancer and associated with the following identifiers: GSE10072, GSE19188, GSE63459, GSE75037. Each two columns corresponds to a list, containing 22283, 54675, 24526 and 48803 gene identifiers (SYMBOL) respectively, including duplicate or missing entries. Each gene has its associated statistical value in the second column (in this case, pvalue). This arrangement allows direct comparison of the size and composition of the lists across studies, highlighting the diverse scope of each dataset prior to subsequent meta-analysis. If the user wishes to study this data in depth, it is possible to download these datasets, as well as view their distribution in the UpsetPlot (Section 5.1.1.) and Heatmap (Section 5.1.2.).
Table 2: Sample data content.
3.2 Parameters
Once the gene lists are loaded, six configuration parameters become available to customize the RankProd analysis. These options provide full control over how genes are filtered and ranked. You can fine-tune aspects such as ranking direction, handling of missing values, penalization of genes with low recurrence, and the minimum number of lists a gene must appear in to be considered. This flexibility ensures the meta-analysis is aligned with your experimental design and data quality.
Figure 3: The parameter panel offered by MetaRank to adjust the meta-analysis.
3.2.1 Rank-based Method
A rank-based meta-analysis combines gene rankings across multiple studies or conditions instead of directly comparing raw values. This approach is especially useful when datasets are heterogeneous or measured on different scales, allowing robust integration based on gene order rather than absolute expression.
There are two available modes:
Basic: Uses
RankProd::RP. It assumes that each gene list comes from a unique origin (i.e., no shared batches). This is ideal when the true origin of your data is unknown or when you prefer not to group them explicitly. Suitable for datasets with unknown or homogeneous background (e.g., mixed public data without batch labels).Advanced: Uses
RankProd::RP.Advanced. This mode allows specifying a vector of origins (or batches) for each list via the Origin field (explained in the section 3.2.2.). Recommended when your gene lists come from distinct experimental setups, platforms, conditions, or time points. It adjusts the ranking by grouping lists with the same origin, improving robustness in multi-batch scenarios.
3.2.2 Origin (Advanced only)
The Origin field is required when using the Advanced mode. It should be a comma-separated vector of integers (e.g., 1,1,2,2), where each number indicates the batch or origin of the corresponding input file.
This field:
- Must match the number of input gene lists.
- Allows grouping lists from the same source.
- Each batch must have replicas, i.e. at least two datasets from the same source.
- Is validated with custom error messages if the format is incorrect or inconsistent.
- Is accompanied by an ℹ️ info button with a usage example for user guidance.
Figure 4: Details displayed in the information window ℹ️ about the structure of the Origin variable
3.2.3 Minimum Number of Datasets
This slider sets the minimum fraction of input lists in which a gene must appear to be included in the analysis. Genes present in fewer lists will be excluded before the ranking process. For example: If you set it to 4 and there are 5 input lists, only genes appearing in at least 4 lists will be considered (4 and 5).
This filter helps reduce statistical noise caused by infrequent genes that may distort the consensus ranking. Genes that appear only once are always shown separately in the “excluded genes” table, since they do not allow robust comparison and may bias the analysis if included.
3.2.4 Ranking Direction
This option indicates whether lower or higher values should be considered better rankings. This depends on the type of metric:
Ascending: lower values are better (e.g., pvalues).Descending: higher values are better (e.g., logFC, z-scores, relevance scores).
It is important to choose the correct direction to ensure proper interpretation of the results.
3.2.5 NA Management
This option determines how to handle missing values (genes not present in some of the lists):
| Option | Description |
|---|---|
| Impute NA | Replaces NA with the median rank of the list. Allows applying an extra penalty based on the number of appearances. Useful when preserving all genes and reducing the impact of missing values, for example in exploratory analyses. |
| Ignore NA | Uses only the available values, omitting NAs. Also allows extra penalization based on the number of appearances. Useful when preserving all genes and reducing the impact of missing values, for example in exploratory analyses. |
| Penalize NA | Assigns the worst possible rank to missing values, depending on whether the direction is ascending or descending. Recommended when missing values should be heavily penalized to increase robustness. |
Table 3: Missing values managment options.
If you apply a Minimum Number of Datasets filter that requires genes to appear in all lists, there will be no missing values and this setting will have no effect. Note that the impact of missing value management depends on how strict or tolerant the user wants the analysis to be. More relaxed settings retain more genes but may introduce noise, while stricter settings increase reliability but may discard potentially relevant genes.
3.2.6 Extra Penalization (Impute & Ignore only)
When this option is enabled, an additional penalty is applied to each gene depending on the number of lists in which it appears (calculated before the analysis, but applied after it). The fewer times a gene appears, the worse its adjusted ranking will be, even if it initially ranked well. An adjusted rank is calculated by adding a penalty proportional to the number of lists where the gene is missing.
Conceptual formula:
AdjustedRank = Rank + ((TotalLists - Count) * (MaxRank / TotalLists))
Where:
Rank: the original consensus rank of the gene.TotalLists: the total number of gene lists loaded.Count: the number of lists in which the gene appears.MaxRank: the worst (highest) rank in the current ranking.
This adjustment is especially useful when using the Impute or Ignore NA options, to ensure that genes with limited support across datasets are penalized accordingly and do not dominate the top of the consensus ranking. This helps to prioritize genes that are consistently present and reduce the impact of rare, potentially spurious genes.
3.3 Outputs
3.3.1 Results Table (RankProd)
The main output of the RankProd analysis is a table with the following columns and their meanings:
| Column Name | Description |
|---|---|
| GeneID | Unique gene identifier, which can be a HUGO symbol (e.g., TP53), an Entrez ID (e.g., 7157), or an Ensembl ID (e.g., ENSG00000141510), depending on input. |
| Rank | Consensus ranking of the gene across all input lists; lower values indicate higher overall relevance or consistency among the lists. |
| FileCount | Number of input gene lists in which this gene appears; a higher count suggests greater consistency across datasets. |
| FileNames | Names of the input files where the gene was found, separated by spaces; useful for identifying the sources supporting the gene’s relevance. |
| GenePositions | Rank positions of the gene in each individual input list; provides insight into the gene’s performance across different datasets. |
| RP_stat | Rank Product statistic calculated to assess the significance of the gene’s ranking across multiple lists; lower values suggest higher significance. |
| PFP | Estimated Proportion of False Positives, analogous to False Discovery Rate (FDR); lower values indicate more reliable findings. |
| pvalue | Raw p-value from the meta-analysis, indicating the probability of observing the gene’s ranking by chance; lower values suggest higher significance. |
| p.adjust | Adjusted p-value accounting for multiple hypothesis testing using the Benjamini-Hochberg method; helps control the FDR. |
Table 4: The names of the columns and their meanings in the main table generated by RankProd in MetaRank.
The table can be downloaded in .TSV and .CSV formats. Each column has a tooltip (question mark icon) that shows this information when hovered over. The table is interactive: columns can be filtered by value ranges or keywords, and their order can be customized.
If data filtering is applied, either by value or by selecting specific columns (see Section 5.3.1), the downloaded file will reflect only the currently displayed data. This includes both the filtered rows and the visible columns selected in the user interface.
3.3.2 Excluded Genes
A secondary table is generated and accessible via the eye icon button, also downloadable as .TSV. It always contains genes excluded by the Minimum Number of Datasets filter and those appearing only once. If, for example, a filter of 4/4 is applied, genes appearing in 1, 2, or 3 lists are moved to this excluded table, while only genes appearing in all 4 lists remain in the main table. This table allows tracking of excluded genes and understanding of filtering effects.
| Column Name | Description |
|---|---|
| GeneID | Unique identifier of the excluded gene, in the same format as the input. |
| FileCount | Number of input gene lists in which this gene appears. |
| FileNames | Names of the input files where the gene was found. |
Table 5: The names of the columns and their meanings in the excluded table generated by RankProd in MetaRank.
4 RobustRankAggreg Workflow
4.1 Input Methods
Upload Files: When the “Upload Files” mode is enabled, it is possible to select one or more text files (
.TXT) via the file upload control. The system recognizes each file as a list of genes (one identifier per line, without a header), automatically removes duplicates and missing values, and correctly handles both Unix (\n) and Windows (\r\n) line endings. If more than one gene is provided on a single line separated by delimiters (e.g.,BRCA1///BRCA2), only the first entry (BRCA1) is retained.Clicking the ℹ️ icon opens a modal showing a sample file structure, and example datasets can be downloaded for in-depth study and reference. There is no strict limit to the number of gene lists that can be uploaded, as it depends on the size of each list. For example, when lists contain approximately 20,000 genes, up to 12 have been successfully processed. In contrast, for smaller lists (ranging from 100 to 500 genes), the system has handled up to 50 lists without issue.
Paste Genes: When the “Paste Genes” mode is enabled, gene lists can be entered directly into a text area. Each list is delimited by
###, and within each section the system expects one gene per line, with no header row.Example format Technical format TP53 TP53\n BRCA1 BRCA1\n EGFR EGFR\n ### ###\n BRCA1 BRCA1\n EGFR EGFR\n TP53 TP53Duplicate entries and blank lines are cleaned up automatically, and if a line contains multiple gene identifiers (e.g.,
BRCA1///BRCA2), only the first is used. The placeholder text illustrates this formatting.
Figure 5: a) MetaRank input methods for Robust Rank Aggreg and b) details shown in the ℹ️ info window regarding the required file structure.
Use Example Data: Enabling the “Use Example Data” switch loads predefined files that represent various analysis scenarios, allowing users to explore the workflow without providing their own data.
The table above shows four gene lists used in our example analysis. These lists come from four independent studies related to lung cancer and associated with the following identifiers: GSE10072, GSE19188, GSE63459, GSE75037. Each column corresponds to a list, containing 19417, 21752, 17509 and 13099 gene identifiers (SYMBOL) respectively, including duplicate or missing entries. This arrangement allows direct comparison of the size and composition of the lists across studies, highlighting the diverse scope of each dataset prior to subsequent meta-analysis.
Table 6: The content of the 4 example files together in one table.
4.2 Parameters
Once the gene lists are loaded, three configuration parameters become available to customize the RobustRankAggreg analysis. These options provide some control over how genes are filtered and ranked. You can fine-tune aspects such as selecting the aggregation method, handling of missing values, or even the minimum number of lists a gene must appear in to be considered. This flexibility ensures the meta-analysis is aligned with your experimental design and data quality.
Figure 6: The parameter panel offered by RobustRankAggreg to adjust the meta-analysis.
4.2.1 Aggregation Method
The RobustRankAggreg package offers five aggregation methods to combine rankings across multiple gene lists, ranging from simple statistical approaches to the more sophisticated probabilistic scoring native to the package:
- RRA: Uses a probabilistic model to assign p-values to ranks, based on the minimum probability across all lists. It evaluates how surprising a gene’s ranking is across the datasets using a beta-uniform mixture model.
- Median: Takes the median rank of each gene across all lists. It is robust to outliers and provides a central tendency measure.
- Stuart: A method based on order statistics. It combines ranks using a meta-analysis approach, particularly suitable for independent rankings.
- Geometric Mean: Computes the geometric mean of ranks across lists, giving more weight to consistently low ranks.
- Arithmetic Mean: Averages the rank values directly. This method is sensitive to outliers but intuitive and easy to interpret.
All of these methods rely on the position of genes in the individual rankings to compute a consensus order. The RRA method is unique in that it transforms rankings into p-values and evaluates their statistical significance, accounting for both the number of lists and the positions within each.
4.2.2 Minimum Number of Datasets
This slider sets the minimum fraction of input lists in which a gene must appear to be included in the analysis. Genes present in fewer lists will be excluded before the ranking process. For example: If you set it to 4 and there are 5 input lists, only genes appearing in at least 4 lists will be considered.
This filter helps reduce statistical noise caused by infrequent genes that may distort the consensus ranking. Genes that appear only once are always shown separately in the “excluded genes” table, since they do not allow robust comparison and may bias the analysis if included.
4.2.3 NA Management
This option controls how to handle missing values (i.e., when a gene does not appear in a list). Two strategies are provided:
- Ignore NA: Exclude missing values from the analysis.
- Penalize NA: Assign worst rank for missing entries
In this context, additional penalization is not required beyond what the algorithm already incorporates. The score, also known as rho, is a significance measure used in RobustRankAggreg to reflect how strongly a gene is supported across the rankings. It is based on the minimum p-value method:
- Each gene’s position in a list is converted to a probability. For example, if a gene ranks 5th in a list of 1000, we calculate the probability of randomly selecting a gene ranked 5th or better.
- The lowest (best) of these probabilities across all lists is taken.
- The final score is calculated using a beta-uniform distribution, estimating the likelihood of observing such a good ranking by chance, given how many lists exist and in how many the gene appears.
If a gene is absent from some lists, the method does not assign an artificially bad rank. Instead, the score inherently adjusts for the fact that a gene appeared in fewer lists. This naturally penalizes low-frequency genes unless they show extremely strong evidence in the lists they do appear in. A low score means the gene’s strong ranks are unlikely to be due to chance and that it is consistently important across studies.
4.3 Outputs
4.3.1 Results Table (RRA)
The output table from the RobustRankAggreg (RRA) workflow differs slightly from the one used in RankProd. The available columns are:
| Column Name | Description |
|---|---|
| GeneID | Unique gene identifier, which can be a HUGO symbol (e.g., TP53), an Entrez ID (e.g., 7157), or an Ensembl ID (e.g., ENSG00000141510), depending on input. |
| Rank | Consensus ranking of the gene across all input lists; lower values indicate higher overall relevance or consistency among the lists. |
| Score | Also called rho, this is the probabilistic score assigned by RRA, reflecting the significance of the observed ranks (lower values indicate stronger evidence). |
| p.adjust | Adjusted p-value (multiple testing correction) for the Score using the Benjamini-Hochberg method; helps control the FDR. |
| FileCount | Number of input gene lists in which this gene appears; a higher count suggests greater consistency across datasets. |
| FileNames | Names of the input files where the gene was found, separated by spaces; useful for identifying the sources supporting the gene’s relevance. |
| GenePositions | Rank positions of the gene in each individual input list; provides insight into the gene’s performance across different datasets. |
Table 7: The names of the columns and their meanings in the main table generated by RobustRankAggreg in MetaRank.
The table can be downloaded in TSV and CSV formats. Each column has a tooltip (question mark icon) that shows information when hovered over. The table is interactive: columns can be filtered by intervals or keywords and reordered.
4.3.2 Excluded Genes
A secondary table is generated and accessible via the eye icon button, also downloadable as TSV. It always contains genes excluded by the Minimum Number of Datasets filter and those appearing only once. If, for example, a filter of 4/4 is applied, genes appearing in 1, 2, or 3 lists are moved to this excluded table, while only genes appearing in all 4 lists remain in the main table.
| Column Name | Description |
|---|---|
| GeneID | Unique identifier of the excluded gene, in the same format as the input. |
| FileCount | Number of input gene lists in which this gene appears. |
| FileNames | Names of the input files where the gene was found. |
Table 8: The names of the columns and their meanings in the excluded table generated by RobustRankAggreg in MetaRank.
This table allows tracking of excluded genes and understanding of filtering effects.
6 Background Pipeline
MetaRank performs consensus-based gene ranking using two complementary strategies: RankProd (RP) and RobustRankAggreg (RRA). The workflow processes input gene lists through a series of well-defined steps to produce both tabular and graphical enrichment results:
6.1 Analysis Method Selection
Users can select one of the following meta-ranking algorithms:
- RankProd (RP):
- A non-parametric method for identifying genes that are consistently up- or down-regulated across experiments.
- Relies on ranking statistics typically derived from fold-change values.
- Its functionality has been adapted to improve robustness in consensus ranking, including enhanced handling of
NAvalues during computation.
- Robust Rank Aggregation (RRA):
- Detects genes that consistently appear at the top of ranked lists more often than expected by chance.
- Operates solely on the order of ranks, without requiring fold-change values or p-values.
- Supports multiple ranking strategies, from simple methods like the median rank to more sophisticated ones such as the original RRA algorithm.
6.2 Data Ingestion and Preprocessing
Users can provide gene data by uploading multiple files (.CSV, .TSV, .TXT) or pasting lists directly into a text box (supports multiline input). Additionally, it is possible to work exclusively with the example datasets provided by the app, which are also available for download.
The required input format depends on the selected analysis package:
- RankProd:
- Accepts
.CSV(comma-separated) or.TSV(tab-separated) files. - Each file or list must contain two columns:
- One with gene identifiers (e.g.,
Gene,EntrezID, orEnsembl). - One numeric column representing a ranking metric (e.g.,
pvalue,logFC, etc.).
- One with gene identifiers (e.g.,
- Uploaded files must NOT contain headers. In contrast, when using the paste mode, each list must include headers, specifically
GeneandStat.data. - In paste mode, multiple lists must be separated using the delimiter
###. - Make sure to check the info (ℹ) button for a detailed explanation of the correct input format.
- Accepts
- RRA:
- Accepts
.TXTfiles or pasted plain text. - Each file or list must consist of a single column containing gene identifiers (e.g.,
Gene,EntrezID, orEnsembl), listed one per line in descending order of significance. - In paste mode, multiple lists must be separated using the delimiter
###. - Make sure to check the info (ℹ) button for a detailed explanation of the correct input format.
- Accepts
- Validation:
- The app validates uploaded content to ensure consistent formatting, presence of required columns, and proper delimiters.
- Automatic preprocessing includes:
- Removing blank rows.
- Trimming leading/trailing whitespace.
- Attempting to detect the gene identifier format (SYMBOL, ENSEMBL, or ENTREZID).
- If invalid input is detected, the app shows informative modals and provides example formats to guide the user.
6.3 Gene Appearance Counting and Filtering
For every gene across all input lists, the system:
- Counts the total number of appearances (i.e., in how many lists the gene is found),
- Records the names of the lists (or files) in which the gene appears,
- Stores the rank positions of the gene in each list where it is present. For example, if a gene appears at position 45 in list 1, position 98053 in list 2, and position 1 in list 3, the position vector would be:
45, 98053, 1.
This information is compiled into a detailed appearance table for each gene, enabling complete traceability and data auditing.
A user-defined minimum appearance threshold is applied:
- Genes must appear in a minimum number of input lists to be included in the final meta-analysis.
- This filtering step removes low-frequency or list-specific genes, which helps reduce background noise and increases the robustness of the consensus ranking.
- The threshold is configurable by the user to balance inclusiveness and specificity.
Outputs:
- Included genes: Genes that meet or exceed the appearance threshold. These are used in the meta-ranking process and included in the final results.
- Excluded genes: Genes that do not meet the threshold. These are completely excluded from both the consensus ranking and any enrichment analysis, which also helps reduce computational time. They are still accessible for review and optional download.
6.4 Consensus Ranking Computation
The selected meta-ranking method is applied to the filtered gene lists:
- RankProd:
- Computes rankings for upregulated and downregulated genes separately.
- Utilizes the
RPorRP.advancefunctions from theRankProdBioconductor package. - Includes advanced options such as:
- Handling of missing values (
NA) gracefully. - Custom directionality settings to rank by high or low values depending on the metric.
- Optional penalization of genes with low appearance frequency to further refine the consensus.
- Handling of missing values (
- RRA:
- Aggregates multiple ranked gene lists into a single consensus ranking.
- Uses the
aggregateRanksfunction from theRobustRankAggregpackage. - Performs a permutation-based statistical analysis to calculate:
- P-values, representing the likelihood of observing such high rankings by chance.
- Adjusted p-values, corrected for multiple testing using standard methods (Benjamini-Hochberg).
6.5 Final table creation
After computing the consensus ranking, a final result table is generated by merging the ranking outputs with the detailed gene appearance data.
- For included genes, the final table contains:
- Consensus ranking metrics (e.g., rank, p-value, score depending on method).
- The number of appearances across all input lists.
- A list of input files in which the gene appears.
- A position vector, indicating the gene’s position in each list where it is present.
- This integration enables full traceability and biological interpretability of the ranking.
- For excluded genes, a separate table is created containing:
- The GeneID,
- The number of appearances,
- The names of the input files in which the gene was detected.
- This simplified table is made available for inspection and optional download, but these genes are not used in any part of the ranking or enrichment process.
This two-table approach ensures a transparent analysis pipeline while maintaining performance and interpretability.
6.6 Annotation Retrieval (Optional)
- A dedicated script located at
database_annotations/get_annotations.Ris used to generate local annotation files for Gene Ontology (GO), KEGG, and Reactome. These files includeTERM2GENEandTERM2NAMEmappings required for enrichment analysis. - Instead of relying on online-access functions like
enrichGO()orenrichKEGG(), the system utilizes the more generalenricher()function from theclusterProfilerpackage. This approach:- Loads annotations into memory at runtime.
- Significantly improves performance.
- Prevents errors caused by lack of internet connectivity or remote service timeouts.
- All annotation files are stored in the
/database_annotations/directory and are automatically loaded by the app when enrichment is requested.
6.7 Over-Representation Analysis (Optional)
- Based on user settings, a subset of the top-ranked genes (from 10 up to a maximum of 100) is selected from the consensus list.
- This selected gene set is then used to perform over-representation analysis against the locally loaded annotation databases.
- The enrichment analysis output includes:
- Term ID (e.g., GO:0008150, R-HSA-123456),
- Description of the biological term or pathway,
- Raw p-values, and
- Associated gene sets involved in the enrichment.
6.8 Result Presentation
After the consensus analysis and optional enrichment, results are presented through multiple interactive and downloadable formats:
- Interactive Results Table:
- Displays the final list of ranked genes.
- Features include column visibility toggling, dynamic filtering, and downloadable formats (
.csvand.tsv).
- Excluded Gene Table:
- Displays genes filtered out due to low appearance frequency.
- Includes number of appearances and list of files in which each gene was found.
- Downloadable as
.TSVonly. - This table is optional and can be toggled on/off for inspection.
- Upset Plot:
- Visualizes intersections between input lists (i.e., which genes are shared across how many lists).
- Fully interactive with customization options.
- Downloadable as
.PNGand.JPG.
- Heatmap:
- Shows the relative rank position of each gene across input lists.
- Provides customization options for clustering, color schemes, and font sizes.
- Downloadable as
.PNG,.JPG, and interactive.HTML.
- Enrichment Results Table (optional):
- Displays functional terms or biological pathways enriched among the selected genes.
- Includes term ID, description, p-values, and matching genes.
- Can be exported and explored alongside plots.
- Plotting Options for Gene Ranking:
- Choose between dot plot or bar plot representations.
- Customizable settings:
- Number of top-ranked genes to show.
- Color by p-value, rank, or appearance count.
- Axis labels (e.g., Gene Symbol, Rank, Score).
- Text size and color scale.
- Download options include
.PNG,.JPG, and interactive.HTML.
Figure 18: MetaRank workflow.
7 Best Practices
To ensure accurate and meaningful results, users are advised to follow these best practices when using the app:
Choose the appropriate method for your data:
- Evaluate your dataset characteristics before selecting the meta-ranking algorithm.
- If your data includes statistical values or fold changes,
RankProdmay be more suitable. - For simpler inputs or when only the order of genes matters,
RRAoffers a robust, non-parametric option. - Consider the presence of missing data, the number of lists, and the desired level of analytical stringency.
Ensure correct input formatting:
- RRA input files should contain one gene identifier per line (no headers).
- RankProd files must contain two columns: gene ID and statistic (no header).
- Input pasted directly into the app may include headers.
- File format buttons and info modals provide validated examples — users are encouraged to consult them before uploading.
- Use official nomenclature for gene identifiers (
SYMBOL,ENTREZID,ENSEMBL) to ensure correct mapping.
Match gene ID type and organism:
- Confirm that the gene identifier type corresponds to the selected organism (e.g.,
SYMBOLwith Homo sapiens). - Incorrect matching can lead to failed enrichment or unmapped terms.
- Confirm that the gene identifier type corresponds to the selected organism (e.g.,
Use provided example datasets:
- Preloaded example files demonstrate the expected input structure.
- Running these examples helps verify that preprocessing, method selection, and plotting work as intended.
Tune the appearance filter carefully:
- A higher minimum appearance threshold increases the reliability of results but may exclude meaningful genes present in fewer lists.
- A lower threshold includes more data but may increase background noise or false positives.
- Adjust this parameter based on dataset size and study goals.
Set the enrichment gene count wisely:
- When running enrichment analysis, the number of top-ranked genes selected (e.g., 10 to 100) strongly impacts both result relevance and runtime.
- Choose this number based on the biological context and size of the ranked list.
Inspect excluded genes and terms:
- If expected results are missing from the output, consult the excluded genes/terms tables.
- This can reveal whether important items were filtered out due to low appearance or other criteria.
Customize visualizations for clarity:
- Modify axis labels, font sizes, color schemes, and number of top genes or terms to improve readability.
- Interactive
.HTMLexports are ideal for in-depth exploration, while.PNGand.JPGare publication-ready.
Monitor computational performance:
- Large-scale analyses (e.g., >30 input lists or >1000 terms) may increase memory and runtime demands.
- To optimize performance, reduce the number of input lists, narrow enrichment filters, or raise the minimum appearance threshold.
8 Troubleshooting
Below is a list of common issues users may encounter during analysis, along with suggested solutions and interpretations.
| Issue | Possible Solution |
|---|---|
| File Format Error | This applies to both RankProd and RRA input files. Ensure that files are in .txt or .csv format without headers. For RankProd, each file must contain two columns: a gene identifier and a numeric statistic (e.g., logFC or fold change), separated by tabs or spaces. For RRA, the file must contain a single column of gene identifiers (one per line). Avoid special characters such as commas, semicolons, or #. Use info modals to preview accepted formats. |
| Paste Format Error (RRA) | When pasting data for RRA, each list must be separated by a line containing only ###, and each gene must be on a separate line. Do not include headers or special characters. Avoid using tabular formatting or lists copied from spreadsheets. |
| Paste Format Error (RankProd) | For RankProd, pasted input should be structured with two columns per list, separated by tab or space: gene ID and statistic. Ensure there are no headers and that all values in the second column are numeric. Lines separating lists must include ###. |
| Invalid origin Field | This applies to RankProd input only. Several possible issues can arise: (1) empty origin values (no data after ###), (2) wrong separator (e.g., commas or semicolons instead of tab or space), (3) using non-numeric values (e.g., letters instead of logFC), (4) mismatch between number of provided origin values and number of lists, (5) missing replicates. The app will inform users of the specific problem encountered. |
| Invalid Organism | If the selected organism does not match the gene nomenclature used in the input, mapping may fail. Ensure that gene identifiers follow expected conventions: human (TP53, BRCA1), mouse (Trp53, Brca1), rat, etc. Use the dropdown to match the correct species code (e.g., Hsa, Mmu). |
| Invalid Gene Identifiers | Ensure consistency between the selected gene ID type and the actual format of your genes. SYMBOL = gene names like TP53; ENTREZID = numeric-only IDs; ENSEMBL = IDs like ENSG00000.... Mixing types can lead to errors or dropped genes. |
| No Enrichment Results | If enrichment tables return empty, it is likely due to a very strict filter or insufficient number of shared genes across lists. Try lowering the “Minimum Number of Lists” threshold and verify that the selected genes have known annotations. |
| Appearance Threshold Too High | If no genes remain after filtering, the “Minimum Number of Lists” threshold may be too restrictive. Reduce this value to allow inclusion of genes present in fewer lists. |
| No Repeated Genes Across Lists | If each list contains unique genes with no overlap, no consensus ranking or enrichment will be possible. Consider whether the gene lists are comparable and whether overlaps exist. |
| Slow Performance | Performance issues may arise when processing more than 30 lists or more than 1000 terms. To improve speed: reduce the number of lists, use fewer genes per list, or increase the minimum dataset threshold. Also consider simplifying visualization parameters. |
Table 10: The list of possible errors that can be experienced and their possible cause
In addition to the summarized issues and suggestions presented in the table above, the following section visually illustrates the main error messages that may appear during the use of the application. Each message corresponds to a common input or processing issue, offering a brief explanation of what went wrong, how it affects the workflow, and what can be done to resolve it. In some cases, specific files or pieces of information that caused the problem are clearly indicated to help the user correct the input efficiently and continue without interruption.
Table 19: Pop-up window displayed when an error related to the file format is detected.
Table 20: Pop-up window displayed when an error related to the paste format (RRA) is detected.
Table 21: Pop-up window displayed when an error related to the paste format (RP) is detected.
Table 22: Pop-up window displayed when an error related to the origin format is detected.
Table 23: Pop-up window displayed when an error related to the organism selection is detected.
Table 24: Pop-up window displayed when an error related to the Gene ID selection is detected.
Table 25: Pop-up window displayed when no enrichment results were found.
Table 26: Pop-up window that appears when no common genes are found between the lists established by the Appearance Threshold.
Table 27: Pop-up window that appears when no common genes are found.